Motivations

As new and returning residents of New York, we are intrigued by the bustling rat population around us and across each borough. We are interested in exploring the impact of factors like day of the week, month, borough, latitude, longitude, and type of building has on rat sighting around New York. By understanding what influences the amount and location of rat sightings, we will be able to know which areas of the city to avoid and which areas may have the world’s best ratatouille.

Background: History of Rats in NYC

Brown rats are not indigenous to New York; they are a product of the colonization of North America and were brought over on ships in the 18th century. The brown rat is the most common variety in New York today and slowly overtook the black rat population due to their highly aggressive and dominant nature. Brown rats’ ability to tunnel through and eat just about anything has lead them to be one of the top pest in New York for millennia. A testament to their cultural relevance, the Rolling Stones 1978 record, Shattered, made a reference to the rats of New York City: “We’ve got rats on the west side”. More recently, these omnivorous creatures have even spurred the creation of a city government position titled Director of Rodent Mitigation or, colloquially, “Rat Czar”.

Initial Questions

We set out to answer the following questions:

  • How do rat sightings vary over time (month, day of the week, year)
  • How do rat sightings vary by borough and where are they concentrated?
  • What are the important factors in predicting a rat sighting?

Throughout the course of working on the project, we became interested in sightings year-over-year and included some time series plots in the analysis.

Data

Our is publicly available from Open Data NYC (https://data.cityofnewyork.us/Social-Services/Rat-Sightings/3q43-55fe), downloaded in November, 2023. The raw data contains 232,090 records of rat sightings and variables relating to geographical location, type of location, and time of sighting. In order to begin the data cleaning and analysis process, we loaded the following libraries:

  • tidyverse
  • lubridate
  • readr
  • xts
  • RColorBrewer
  • ggthemes
  • gridExtra
  • leaflet
  • highcharter
  • scales
library(tidyverse)
library(lubridate)
library(readr) 
library(xts)
library("RColorBrewer")
library("ggthemes")
library("gridExtra")
library("leaflet")
library(leaflet.extras)
library("highcharter")
library(scales)

Importing and Cleaning

We begin by importing the rat sightings data using the read_csv function, clean up the variable names with the clean_names function, and create some more useful date variables in a mutate pipeline.

rats_raw <- read_csv("./Rat_Sightings.csv", na = c("", "NA", "N/A", "Unspecified")) %>%
  janitor::clean_names() %>% 
  mutate(created_date = mdy_hms(created_date)) %>%
  mutate(sighting_year = year(created_date),
         sighting_month_num = month(created_date),
         sighting_month = month(created_date, label = TRUE, abbr = FALSE),
         sighting_day = day(created_date),
         sighting_weekday = wday(created_date, label = TRUE, abbr = FALSE)) 

There are 232,090 records of rat sightings, ranging from 2010 to 2023 and across all 5 boroughs.

Important variables to our analysis include:

  • created_date: Date of rat sighting record
  • sighting_year: Year of sighting
  • sighting_month: Month of sighting
  • sighting_day: Sighting day of the month
  • sighting_weekday: Sighting day of the week
  • location_type: Rat sighting location type (Government Building, 3+ Family Apt. Building, Construction site, etc.)
  • city: City of sighting
  • borough: Borough of sighting
  • latitude: Latitude of sighting
  • longitude: Longitude of sighting

Exploratory Analyses

We first explored how rat sightings vary over time (month, day of the week, year) and how rat sightings vary by borough. To do so, we used simple tables, bar charts, line plots, heat maps, and interactive maps. First, we visualize rat sightings by borough, different measures of time (year, month, and day of the week), and the different types of locations they are found at.

Rats Sightings by Bourough

by_borough <- rats_raw %>% 
  filter(!is.na(borough)) %>%
  group_by(borough) %>% 
  count() %>% 
  ggplot(aes(x = borough, y = n, fill = n)) + 
  geom_histogram(stat = "identity", position = "dodge") +
  theme(legend.position ='none',axis.title = element_text(),axis.text.x = element_text(size = 12)) +
  xlab("Borough") + 
  ylab("Count") +
  geom_text(aes(label = n), vjust = -0.1, size = 3.75) +
  ggtitle('Count of Rat Sightings by Borough') + 
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) 

by_borough

rat_borough <-
rats_raw %>%
  group_by(borough, sighting_month, sighting_year) %>%
  mutate(rat_per_month = n()) %>%
  slice(1) %>%
  select(borough, sighting_month, sighting_year, rat_per_month)

test_borough <- broom::tidy(oneway.test(rat_per_month ~ borough, data = rat_borough))

We see that Brooklyn has the highest amount of rats, followed by Manhattan, the Bronx, Queens, and Staten Island. Running a one-way ANOVA test on the number of monthly rat sightings by borough, we see that the p-value is 3.7065548^{-147} and at least one of the borough means is not equal to the others.

Rat Sightings by Year

by_year <- rats_raw %>% 
  group_by(sighting_year) %>% 
  count() %>% 
  ggplot(aes(x = sighting_year, y = n, fill = n)) + 
  geom_histogram(stat = "identity", position = "dodge") +
  theme(legend.position ='none',axis.title = element_text(),axis.text.x = element_text(size = 12)) +
  xlab("Year") + 
  ylab("Count") +
  geom_text(aes(label = n), vjust = -0.1, size = 3.75) +
  ggtitle('Count of Rat Sightings through the Years') + 
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) 

by_year

rat_year <-
rats_raw %>%
  group_by(sighting_month, sighting_year) %>%
  mutate(rat_per_month = n()) %>%
  slice(1) %>%
  select(sighting_month, sighting_year, rat_per_month)

test_year <- broom::tidy(oneway.test(rat_per_month ~ sighting_year, data = rat_year))

We see a substantial increase in the number of rat sightings after 2020. This increase is consistent with the city of New York’s rat media coverage and the impact of the COVID-19 pandemic. With more restaurants closed and more restaurants offering outdoor dining, rats are more likely to scavenge outside. A warmer, wetter than usual summer in 2021 also contributed to favorable rat conditions. Running a one-way ANOVA test on the number of monthly rat sightings by year, we see that the p-value is 3.9316995^{-13} so at least one of the borough means is not equal to the others.

Rat Sightings by Month

by_month <- rats_raw %>% 
  group_by(sighting_month) %>% 
  count() %>% 
  ggplot(aes(x = sighting_month, y = n, fill = n)) + 
  geom_histogram(stat = "identity", position = "dodge") +
  theme(legend.position ='none',axis.title = element_text(),axis.text.x = element_text(size = 9)) +
  xlab("Month") + 
  ylab("Count") +
  geom_text(aes(label = n), vjust = -0.1, size = 3.75) +
  ggtitle('Count of Rat Sightings by Month') + 
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) 

by_month

rat_monthly <-
rats_raw %>%
  group_by(sighting_month, sighting_year) %>%
  mutate(rat_per_month = n()) %>%
  slice(1) %>%
  select(sighting_month, sighting_year, rat_per_month)

test_month <- broom::tidy(oneway.test(rat_per_month ~ sighting_month, data = rat_monthly))

The most rat sightings are in the summer months with a peak in July. Sightings taper off in the fall, reaching a low in December, and then start to increase in the spring. Warmer weather is more favorable to rat survival and helps their populations grow. Running a one-way ANOVA test on the number of rat sightings by month, we see that the p-value is 7.6361763^{-8} so at least one of the month means is not equal to the others.

Rat Sightings by Day of the Week

by_day <- rats_raw %>% 
  group_by(sighting_weekday) %>% 
  count() %>% 
  ggplot(aes(x = sighting_weekday, y = n, fill = n)) + 
  geom_histogram(stat = "identity", position = "dodge") +
  theme(legend.position ='none',axis.title = element_text(),axis.text.x = element_text(size = 12)) +
  xlab("Weekday") + 
  ylab("Count") +
  geom_text(aes(label = n), vjust = -0.1, size = 4) +
  ggtitle('Count of Rat Sightings by Day of Week') + 
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) 

by_day

rat_dow <-
rats_raw %>%
  group_by(sighting_weekday, sighting_month, sighting_year) %>%
  mutate(rat_per_month = n()) %>%
  slice(1) %>%
  select(sighting_weekday, sighting_month, sighting_year, rat_per_month)

test_dow <- broom::tidy(oneway.test(rat_per_month ~ sighting_weekday, data = rat_dow))

Weekdays have the most rat sightings, peaking on Mondays and staying relatively high throughout the week, while weekends have much lower counts. Running a one-way ANOVA test on the number of rat sightings by day of the week, we see that the p-value is 7.6361763^{-8} so at least one of the day of the week means is not equal to the others.

Rat Sightings by Location Type

for_location_type <- rats_raw %>% 
  drop_na(location_type) %>%
  filter(location_type != "Other (Explain Below)") %>%
  group_by(location_type) %>%
  mutate(count_loc = n()) %>%
  ungroup() %>%
  filter(location_type %in% c("3+ Family Apt. Building", "1-2 Family Dwelling", "3+ Family Mixed Use Building", "Commercial Building", "Vacant Lot", "Construction Site"))

ggplot(data = for_location_type, aes(x = fct_infreq(location_type))) + 
  geom_bar() +
  theme_minimal() + 
  coord_flip() +
  labs(title = "Top 6 Location Types for Sightings",
       x = "Location Type",
       y = "Count")

rat_location <-
rats_raw %>%
  filter(location_type %in% c("3+ Family Apt. Building", "1-2 Family Dwelling", "3+ Family Mixed Use Building", "Commercial Building", "Vacant Lot", "Construction Site")) %>%
  group_by(location_type, sighting_month, sighting_year) %>%
  mutate(rat_per_month = n()) %>%
  slice(1) %>%
  select(location_type, sighting_month, sighting_year, rat_per_month)

test_location <- broom::tidy(oneway.test(rat_per_month ~ location_type, data = rat_location))

The above shows the top 6 location types for rat sightings. 3+ Family Apt. Buildings report the highest amount of rat sightings among all location types, while 1-2 Family Dwellings and 3+ Family Mixed Use Buildings report the next two highest amount of sightings. These location types are followed by commercial buildings, vacant lots, and construction sites. Running an ANOVA for monthly rat sightings against the top 6 location types, there is a p-value of 5.9127557^{-157} so at least one of the location type means is not equal to the others.

Interactive Plots

In order to display rat sightings across New York City, we opted to create an interactive heat map, showing the hot spots for rat sightings, and a time series plot.

Heat Map

Focusing in on Columbia University’s Presbyterian Hospital campus at 168th Street, there is a lack of hot spots, but the immediate surrounding areas appear to have fairly normal sightings.

## Overall Sightings Map and Heat Map

top = 40.917577 # north lat
left = -74.259090 # west long
right = -73.700272 # east long
bottom =  40.477399 # south lat


nyc = rats_raw %>%
  filter(latitude >= bottom) %>%
  filter ( latitude <= top) %>%
  filter( longitude >= left ) %>%
  filter(longitude <= right)

center_lon = median(nyc$longitude,na.rm = TRUE)
center_lat = median(nyc$latitude,na.rm = TRUE)

factpal = colorFactor("blue", nyc$n)

nyc %>%
  leaflet() %>%
  addProviderTiles("Esri.NatGeoWorldMap") %>%
  addHeatmap(lng = ~longitude, lat = ~latitude, intensity = ~(nyc$n), blur = 20, max = 0.05, radius = 15) %>%
  setView(lng=center_lon, lat=center_lat,zoom = 10)

Time series plot

Over the 13 years of data, rat sightings have increased and display a cyclical pattern each year. Rat sightings drop in the colder months and rise in the warmer months each year. Rat sightings in 2020-2023 appear to have increased

overall <- rats_raw %>% 
  group_by(sighting_year, sighting_month_num, sighting_day) %>% 
  summarize(count = n()) %>% 
  mutate(date = as.Date(paste(sighting_year, sighting_month_num, sighting_day, sep = "-")))

time_series = xts(overall$count , order.by= overall$date)

hchart(time_series, name = "Rat Sightings") %>% 
  hc_add_theme(hc_theme_darkunica()) %>%
  hc_credits(enabled = TRUE, text = "Sources: City of New York", style = list(fontSize = "12px")) %>%
  hc_title(text = "Time Series of NYC Rat Sightings") %>%
  hc_legend(enabled = TRUE)

Additional Analyses

Modeling 1

Insert

Modeling 2

Insert

Discussion

Our project aimed to investigate the factors that contribute to rat sightings in New York City. We were interested in how rat sightings vary by different measures of time, borough, location type and what variables are most important.

To investigate the impact of these variables, we created a series of exploratory plots that displaying the relationship between rat sightings and year, month, day of the week, borough, and location type. We followed each up with an ANOVA test to test the relationship between monthly rat sightings and each group; all were statistically significant. We learned that rat sightings have increased year-to-year, with large increases since 2020. This is likely due to more rats scavenging outside due to pandemic-induced restaurant closures and summers becoming significantly warmer and wetter than summers in prior years. We also learned that Brooklyn has the most rat sightings of all boroughs and apartment buildings with 3 or more units have the most rat sightings of all location types.

In addition to our data visualizations, we conducted linear models predicting latitude and longitude & rat sightings per month. We found 4 predictors that are helpful in predicting monthly rat sightings (location_type, sighting_year, borough, and season), with coefficients in the expected directions according to our exploratory plots.

Through this analysis, we now understand some of the factors that influence the amount and location of rat sightings in New York City. It became clear that there’s no avoiding our furry neighbors.